Figure 0.1: Business School in the name of Thomas Bayes, City of London
This block gives an introduction to the fundamentals of the modern Bayesian statistics and modelling. The introduction describes the stages involved in Bayesian analysis, from specifying the prior and data models to deriving inference, model checking and refinement.
[📖] Bayesian statistics is an approach to data analysis based on Bayes’ theorem, where available knowledge about parameters in a statistical model is updated with the information in observed data. Unique for Bayesian statistics is that all observed and unobserved parameters in a statistical model are given a joint probability distribution, termed the prior distributions.
The basis of Bayesian statistics was first described in a 1763 essay written by Reverend Thomas Bayes and published by Richard Price on inverse probability, or how to determine the probability of a future event solely based on past events. It was not until 1825 that Pierre Simon Laplace published the theorem we now know as Bayes’ theorem. Although the ideas of inverse probability and Bayes’ theorem are longstanding in mathematics, these tools became prominent in applied statistics in the past 50 years.
Modern Bayesian approach is the process of fitting a statistical model to a set of sample data and summarizing the result by a probability distribution on the parameters of the model, and furthermore on unobserved quantities such as predictions for new observations.
A primary motivation for Bayesian thinking is that it facilitates a common-sense interpretation of statistical conclusions. For instance, a Bayesian (probability) interval for an unknown quantity of interest can be directly regarded as having a high probability of containing the unknown quantity, in contrast to a frequentist (confidence) interval, which may strictly be interpreted only in relation to a sequence of similar inferences that might be made in repeated practice. Recently in applied statistics, increased emphasis has been placed on interval estimation rather than hypothesis testing, and this provides a strong impetus to the Bayesian viewpoint, since it seems likely that most users of standard confidence intervals give them a common-sense Bayesian interpretation.
Bayesian Workflow - The general framework of Bayesian statistics involves three sequential steps: Prior, Likelihood, and Posterior. Specifically, capturing available knowledge about a given parameter in a statistical model via the prior distribution, which is typically determined before data collection; determining the likelihood function using the information about the parameters available in the observed data; and combining both the prior distribution and the likelihood function using Bayes’ theorem in the form of the posterior distribution.
Prior Distribution. Beliefs held by researchers about the parameters in a statistical model before seeing the data, expressed as probability distributions.
Likelihood Function. The conditional probability distribution of the given parameters of the data, defined up to a constant.
Posterior Distribution. A way to summarize one’s updated knowledge, balancing prior knowledge with observed data. It is used to conduct inferences.
Figure 1.1: The Bayesian research cycle
Given the important roles that the prior and the likelihood have in determining the posterior, it is imperative that these steps be conducted with care.
The central feature of Bayesian statistics, the direct quantification of uncertainty, means that there is no impediment in principle to fitting models with many parameters and complicated multilayered probability specifications.
The foundation of Bayesian statistics is to treat the model parameter as uncertain variables, while using the observed sample data to estimate/update the uncertainty of the model.
As general notation, we let \(\mathbf{\theta}\) denote unobservable vector quantities or population parameters of interest (such as the probabilities of survival under each treatment for randomly chosen members of the population in the example of the clinical trial), \(\mathbf{y}\) denote the observed data (such as the numbers of survivors and deaths in each treatment group). In general these symbols represent multivariate quantities of randomness. We assume there are \(n\) number of observations with \(m\) variables for data observations.
Given sample observation \(\mathbf{y} \in \mathbb{R}^{n \times m}\) with statistical model of parameter \(\mathbf{\theta}\). The prior density \(q(\mathbf{\theta})\) is a distribution for \(\mathbf{\theta}\); the likelihood representing the data generation process of data \(p(\mathbf{y}|\mathbf{\theta})\). In order to make probability statements about \(\mathbf{\theta}\) given \(\mathbf{y}\), we begin with a model providing a joint probability distribution for \(\mathbf{\theta}\) and \(\mathbf{y}\)
\[ p(\mathbf{\theta}, \mathbf{y}) = p(\mathbf{y}|\mathbf{\theta}) \times q(\mathbf{\theta}) \]
A prior density \(q(\mathbf{\theta})\) is updated after observing the random sample realization \(\mathbf{y} = (y_1, ...,y_n)\), yielding the posterior distribution density function \(q(\mathbf{\theta} | \mathbf{y})\) for the model parameters \(\mathbf{\theta}\). The basic property of conditional probability known as Bayes’ rule, yields the posterior density:
\[ q(\mathbf{\theta} | \mathbf{y}) = \frac{p(\mathbf{\theta}, \mathbf{y})}{ p(\mathbf{y})} = \frac{ p(\mathbf{y}|\mathbf{\theta}) \times q(\mathbf{\theta}) }{ p(\mathbf{y})} \]
where \(p(\mathbf{y}) = \int_{\mathbf{\theta}} p(\mathbf{y}|\mathbf{\theta}) \times q(\mathbf{\theta}) \; d \mathbf{\theta}\). This nature distribution of data is not depend on \(\mathbf{\theta}\) with fixed \(\mathbf{y}\) and thus can be considered a constant.
In the light of the above Bayes theorem, we could derive the final formula for unnormalized posterior distribution:
\[ q(\mathbf{\theta} | \mathbf{y}) \propto p(\mathbf{y}|\mathbf{\theta}) \times q(\mathbf{\theta}) \]
Prior distributions \(q(\mathbf{\theta})\), shortened to priors, are first determined. The selection of priors is often viewed as one of the more important choices that a researcher makes when implementing a Bayesian model as it can have a substantial impact on the final results. The appropriateness of the priors being implemented is ascertained using the prior predictive checking process.
Prior distributions play a defining role in Bayesian statistics. Priors can come in many different distributional forms, such as a normal, uniform or Poisson distribution, among others. Priors can have different levels of informativeness; the information reflected in a prior distribution can be anywhere on a continuum from complete uncertainty to relative certainty.
Hyperparameters are the individual parameters that control the amount of uncertainty in the priors. Take a normal prior as an example. This distribution is defined by a mean and a variance that are the hyperparameters for the normal prior, and we can write this distribution as \(\mathcal{N}(\mu_0, \sigma_0^2)\), where \(\mu_0\) represents the mean and \(\sigma_0\) represents the variance. A larger variance represents a greater amount of uncertainty surrounding the mean, and vice versa. The mean hyperparameter can be seen as the peak in the distribution.
Prior elicitation is the process by which a suitable prior distribution is constructed. Strategies for prior elicitation include asking an expert or a panel of experts to provide values for the hyperparameters of the prior distribution.
Prior elicitation can also involve implementing data-based priors, where the hyperparameters for the prior are derived from the sample data using methods such as maximum likelihood or sample statistics. [⚠️] These procedures lead to double-dipping, as the same sample data set is used to derive prior distributions and to obtain the posterior. Although data-based priors are relatively common, we do not recommend the use of double-dipping procedures. Instead, a hierarchical modelling strategy can be implemented, where priors depend on hyperparameter values that are data-driven — for example, sample statistics pulled from the sample data — which avoids the direct problems linked to double-dipping.
Prior (un)certainty. An informative prior is one that reflects a high degree of certainty about the model para- meters being estimated.
Informative prior reflects a high degree of certainty or knowledge surrounding the population parameters. Hyperparameters are specified to express particular information reflecting a greater degree of certainty about the model parameters being estimated. A researcher may want to use an informative prior when existing information suggests restrictions on the possible range of a particular parameter, or a relationship between parameters, such as a positive but imperfect relationship between susceptibility to various medical conditions.
Weakly informative prior is a a prior incorporating some information about the population parameter but that is less certain than an informative prior.
Diffuse prior reflects a great deal of uncertainty about the model parameter. This prior form represents a relatively flat density and does not include specific knowledge of the parameter. A researcher may want to use a diffuse prior when there is a complete lack of certainty surrounding the parameter. [🤔] In this case, the data will largely determine the posterior. Even a completely flat prior, such as the Jeffreys prior, still provides information about the degree of uncertainty. Therefore, no prior is truly non-informative.
Impact of Priors. Overall, there is no right or wrong prior setting. Regardless of the informativeness of the prior, it is always important to conduct a prior sensitivity analysis to fully understand the influence that the prior settings have on posterior estimates.
When the sample size is small, Bayesian estimation with mildly informative priors is often used, but the prior specification might have a huge effect on the posterior results.
When priors do not conform with the likelihood, this is not necessarily evidence that the prior is not appropriate. It may be that the likelihood is at fault owing to a mis-specified model or biased data. The difference between the prior and the likelihood may also be reflective of variation that is not captured by the prior or likelihood alone.
The subjectivity of priors is highlighted by critics as a potential drawback of Bayesian methods. We argue two distinct points here.
First, many elements of the estimation process are subjective, aside from prior selection, including the model itself and the error assumptions. To place the notion of subjectivity solely on the priors is a misleading distraction from the other elements in the process that are inherently subjective.
Second, priors are not necessarily a point of subjectivity. They can be used as tools to allow for data-informed shrinkage, enact regularization or influence algorithms towards a likely high-density region and improve estimation efficiency.
Because inference based on a Bayesian analysis is subject to the ‘correctness’ of the prior, it is of importance to carefully check whether the specified model can be considered to be generating the actual data - prior predictive checking.
Prior predictive checking is an exercise to improve the understanding of the implications of the specified priors on possible observations. It is not a method for changing the original prior, unless this prior explicitly generates incorrect data.
Box suggested deriving a prior predictive distribution from the specified prior. The prior predictive distribution is a distribution of all possible samples that could occur if the model is true. In theory, a “correct” prior provides a prior predictive distribution similar to the true data-generating distribution.
Prior predictive checking compares the observed data, or statistics of the observed data, with the prior predictive distribution, or statistics of the predictive distribution, and checks their compatibility.
Young and Pettit propose using a Bayes factor to compare two priors. The Bayes factor would favour the more precise prior. These approaches leave the determination of prior–data conflict subjective, depending on an arbitrary cut-off value. The data agreement criterion tries to resolve the prior–data conflict determination issue by introducing a clear classification, removing the subjective element of this decision.
Likelihood function \(p(\mathbf{y}|\mathbf{\theta})\) is used in both Bayesian and frequentist inference. In both inference paradigms, its role is to quantify the strength of support the observed data lends to possible value(s) for the unknown parameter(s).
The key difference between Bayesian and frequentist inference is that frequentists do not consider probability statements about the unknown parameters to be useful. [👹] Instead, the unknown parameters are considered to be fixed; the likelihood is the conditional probability distribution \(p(\mathbf{y}|\mathbf{\theta})\) of the data (\(\mathbf{y}\)), given fixed parameters (\(\mathbf{\theta}\)).
In Bayesian inference, unknown parameters are referred to as random variables in order to make probability statements about them. The (observed) data are treated as fixed, whereas the parameter values are varied; the likelihood is a function of \(\mathbf{\theta}\) for the fixed data \(\mathbf{y}\).
[📖] The likelihood function in Bayesian framework hence summarizes the following elements: a statistical model that stochastically generates all of the data, a range of possible values for \(\mathbf{\theta}\) and the observed data \(\mathbf{y}\).
Much of the discussion surrounding Bayesian inference focuses on the choice of priors, and there is a vast literature on potential default priors. The importance of the likelihood is often omitted from the discussion, even though the specified model for the data — represented by the likelihood function — is the foundation for the analysis.
Specifying a likelihood function can be very straightforward in special use case. However, in practice, the underlying data-generating model is not always known. Researchers often naively choose a certain data-generating model out of habit or because they cannot easily change it in the software. Although based on background knowledge, the choice of the statistical data-generating model is subjective and should therefore be well understood, clearly documented and available to the reader.
[❓] How a model can be fitted to data to obtain a posterior distribution, how to select variables and why posterior predictive checking is needed.
Posterior distribution \(q(\mathbf{\theta} | \mathbf{y})\) is the result of the prior distribution in interaction with the assumed probability model for the data in the context of the observed data. Once the statistical model has been defined and the associated likelihood function derived, the next step is to fit the model to the observed data to estimate the unknown parameters of the model.
Model building is an iterative process; any Bayesian model can be viewed as a placeholder that can be improved in response to new data or lack of fit to existing data, or simply through a process of model refinement.
In Bayesian statistics, the focus is on estimating the entire posterior distribution of the model parameters \(\mathbf{\theta}\). This posterior distribution is often summarized with associated point estimates, such as the posterior mean or median, and a credible interval.
The frequentist framework for model fitting focuses on the expected long-term outcomes of an experiment with the intent of producing a single point estimate for model parameters such as the maximum likelihood estimate and associated confidence interval.
Within the Bayesian framework for model fitting, probabilities are assigned to the model parameters, describing the associated uncertainties.
Direct inference on the posterior distribution is typically not possible, as the mathematical equation describing the posterior distribution is usually both very complicated and high-dimensional, with the number of dimensions equal to the number of parameters. In particular, the denominator of the expression for the posterior distribution is a function of only the data, where this function is not available in closed form but expressible only as an analytically intractable integral.
[📖] Note that this intractability of the posterior distribution was the primary practical reason why Bayesian statistics was discarded by many scientists in favour of frequentist statistics.
Markov chain Monte Carlo (MCMC) is a technique for sampling from a probability distribution, can be used to fit models to data within the Bayesian paradigm. In particular, the MCMC algorithm only requires the probability distribution of interest to be specified up to a constant of proportionality and is scalable to high dimensions.
MCMC combines two concepts:
obtaining a set of parameter values from the posterior distribution using the Markov chain;
obtaining a distributional estimate of the posterior and associated statistics with sampled parameters using Monte Carlo integration.
Monte Carlo integration is a technique for estimating integrals using computer simulations of sampled values from a given distribution. Given these sampled parameter values, Monte Carlo integration permits estimation of this distribution using associated empirical estimates.
For example, for distributional summary statistics, such as the mean, variance or symmetric 95% credible interval of a parameter, we estimate these summary statistics using the corresponding sample mean, sample variance, and 2.5% and 97.5% quantile parameter values, respectively.
[📖] The marginal posterior distribution of any given parameter can be obtained by kernel density estimation, which uses a non-parametric approach for estimating the associated density from which sampled values have been drawn.
Markov chain proposes to obtain a set of sampled parameter values from the posterior distribution of interest by constructing a Markov chain with a specified first-order transition kernel, such that the resulting stationary distribution of the Markov chain is equal to this posterior distribution of interest.
If the Markov chain is run sufficiently long to reach its stationary distribution, subsequent realizations of the chain can be regarded as a dependent sample from the posterior distribution, and can be used to obtain the corresponding Monte Carlo estimates.
The Gibbs sampler, the Metropolis–Hastings algorithm,and Hamiltonian Monte Carlo are standard approaches for defining the transition kernel so that the corresponding stationary distribution is the correct posterior distribution.
Once a posterior distribution for a particular model is obtained, it can be used to simulate new data \(\mathbf{\tilde{y}}\) conditional on this distribution that might be helpful to assess whether the model provides valid predictions so that these can be used for extrapolating to future events.
A formal posterior predictive checking approach can be taken to evaluate whether the model can be considered a good fit with the data-generating mechanism:
Any parameter-dependent statistic or discrepancy can be used for posterior predictive checking.
The sensitivity of the posterior predictive checks is useful because if realistic models are used, the expectation is that the results are well calibrated in the long-term average.
[⚠️] These uses of posterior predictive checking should be used with care; there is a risk of over-adjusting and over-refining models to the details of a specific data set.
Posterior predictive distributions can further be used to extrapolate beyond the observed data and make predictions. For example, to extrapolate data from a time series.
Formally, after the data \(\mathbf{y}\) has been observed, we can predict an unknown observable \(\mathbf{\tilde{y}}\) from the same process. The distribution of \(\mathbf{\tilde{y}}\) is thus defined as the posterior predictive distribution, posterior because it is conditional on the observed \(\mathbf{y}\), and predictive because it is a prediction for an observable \(\mathbf{\tilde{y}}\).
\[ \begin{aligned} p(\mathbf{\tilde{y}} | \mathbf{y}) &= \int p(\mathbf{\tilde{y}},\mathbf{\theta} | \mathbf{y}) \; d \mathbf{\theta} \\ &= \int p(\mathbf{\tilde{y}} | \mathbf{\theta}, \mathbf{y}) \; q(\mathbf{\theta}| \mathbf{y}) \; d \mathbf{\theta} \\ &= \int p(\mathbf{\tilde{y}} | \mathbf{\theta}) \; q(\mathbf{\theta}| \mathbf{y}) \; d \mathbf{\theta} \end{aligned} \] The second and third lines display the posterior predictive distribution as an average of conditional predictions over the posterior distribution of \(\mathbf{\theta}\). The last step follows from the assumed conditional independence of \(\mathbf{y}\) and \(\mathbf{\tilde{y}}\).
Based on the posterior distributions for a particular model of interest, posterior predictive distributions can be simulated for observed and future data, naturally becoming more uncertain as they predict further into the future owing to accumulated uncertainty.
To illustrate many aspects of Bayesian statistics we provide an example based on real-life data. Consider an empirical example of a study predicting Wikipedia page views.
Consider the page views for the Wikipedia page on the English Premier League — the highest level of the English professional football league — obtained using the wikipediatrend R package.
The decomposable time-series model, implemented in the prophet R package, allows the estimation of trends with non-periodic changes, holiday effects, weekly seasonality and yearly seasonality effects。
Figure (e) shows the posterior predictive distributions at each time point. Posterior predictive distributions for the time points that fall in the observed data interval on which the posterior distribution is conditioned are displayed in light yellow (50% CI) and dark yellow (95% CI), whereas posterior predictive distributions for future data are presented in light green (50% CI) and dark green (95% CI). > [🤔] The posterior predictive distributions for future time points are more uncertain when they are further into the future owing to accumulated uncertainty. Notice that increases and decreases in page views are accurately predicted for future page views, with the exception of increased interest in July 2018 that might relate to the final stage of the FIFA World Cup, which was played at that time.
Figure 3.1: Posterior predictive distribution and predicted future page views based on current observations
The widespread adoption of Bayesian statistics across disciplines is a testament to the power of the Bayesian paradigm for the construction of powerful and flexible statistical models within a rigorous and coherent probability framework. Modern Bayesian practitioners have access to a wealth of knowledge and techniques that allow the creation of bespoke models and computational approaches for particular problems. Probabilistic programming languages, such as Stan, can take away much of the implementation details for many applications, allowing the focus to remain on the fundamentals of modelling and design.
Rather than argue the foundations of statistics, we prefer to concentrate on and highlight the pragmatic advantages of the Bayesian framework, whose flexibility and generality allow it to cope with complex problems.
Driven by the need to support large-scale applications involving data sets of increasing dimensionality and sample numbers, Bayesian concepts have exploited the growth of new technologies centred on deep learning. This includes deep learning programming frameworks, which simplify the use of DNNs, permitting the construction of more expressive, data-driven models that are immediately amenable to inference techniques using off-the-shelf optimization algorithms and state-of-the-art hardware. In addition to providing a powerful tool to specify flexible and modular generative models, DNNs have been employed to develop new approaches for approximate inference and stimulated a new paradigm for Bayesian practice that sees the integration of statistical modelling and computation at its core.
Variational autoencoder is an archetypal example. The underlying statistical model is a simple Bayesian hierarchical latent variable model, which maps high-dimensional observations to low-dimensional latent variables assumed to be normally distributed through functions defined by DNNs. Variational inference is used to approximate the posterior distribution over the latent variables.
Although the full extent of Bayesian principles has not yet been generalized to all recent developments in artificial intelligence, it is nonetheless a success that Bayesian thinking is deeply embedded and crucial to numerous innovations that have arisen. The next decade is sure to bring a new wave of exciting innovative developments for Bayesian intelligence.
Etz, A., “Introduction to the concept of likelihood and its applications”, [2018], Adv. Methods Practices Psychol. Sci. 1, 60–69.
Harvey, A. C. & Peters, S., “Estimation procedures for structural time series models”, [1990], J. Forecast. 9, 89–108.
Laplace, P. S. “Essai Philosophique sur les Probabilities”, [1814], Courcier.
Veen, D. & van de Schoot, R., “Bayesian analysis for PhD-delay dataset”, [2020], OSF, Link